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ABSTRACT 

A  non-technical  discussion  and  the  general  technical  formulation 
of  a  statistical  decision  problem  are  given.     Following  this,   statistical 
decision  theory  is  used  to  solve  a  testing  problem  concerning  a  proto- 
type midget  submarine.     A  set  of  rules  to  be  followed  in  conducting  the 
testing  and  reaching  an  optimum  decision  as  to  whether  to  accept  the 
midget  is  developed.     The  development  proceeds  according  ~to  the 
Bayes  solution  of  a  statistical  decision  problem  in  which  the  stochastic 
variables  are  independently  and  identically  distributed  and  limited  to 
take  only  two  values.     Finally,  brief  discussions  of  the  assumptions 
and  restrictions  of  statistical  decision  theory  and  the  role  of  the  Mini- 
max  solution  are  included. 
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PREFACE 

Operations  research  in  the  Navy  is  concerned  with  the  establish- 
ment of  quantitative  basis  for  command  decision.     To  help  achieve  this, 
the  naval  operations  analyst  is  constantly  seeking  more  useful  tools. 
One  such  tool  is  the  new  theory  of  statistical  decision  functions,  which, 
though  presently  unexploited  in  application,   holds  promise  c£  extensive 
future  use. 

The  theory  of  statistical  decision  functions  was  developed  in  the 
decade  prior  to  1950  by  the  late  Abraham  Wald.     The  development  cul- 
minated in  the  publication  of  his  definitive  book  Statistical  Decision 
Functions    .     The  book  was  written  for  mathematicians,   and  is  too 
cryptic  for  the  reader  of  limited  mathematical  background.     This  fact, 
along  with  the  writer's  belief  that  statistical  decision  theory  can  be  of 
practical  value  to  the  naval  operations  analyst,  prompted  the  present 
thesis.     The  thesis  is  intended  as  an  introduction  to  the  subject,   and,' 
except  for  Chapter  I  which  is  no n- mathematical,   is  directed  toward 
the  reader  who  has  studied  calculus  and  has  completed  an  elementary 
course  in  probability  and  statistics.     The  thesis  purports  to  do  no  more 
than  present  the  most  essential  elements  of  statistical  decision  theory 
and  the  detailed  solution  of  a  simple  special  case.     The  reader  inter- 
ested in  a  more  mature  treatment  is  referred  to  Wald. 

Source  material  for  the  paper  has  consisted  primarily  of  Wald's 
book  and  notes  taken  by  the  author  during  a  course  of  instruction  in 
statistical  decision  functions  given  by  Professor  Thomas  E.   Oberbeck 
at  the  United  States  Naval  Postgraduate  School.     The  contents  are 
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arranged  in  five  chapters  and  an  appendix.     Chapter  I  is  a  non-technical 
discussion  of  a  type  of  practical  problem  that  may  be  solved  by  statisti- 
cal decision  theory.     The  technical  treatment  begins  in  Chapter  II,   where 
the  general  formulation  of  the  Bayes  solution  of  the  statistical  decision 
problem  is  presented.     Chapter  III  introduces  certain  assumptions 
needed  to  apply  the  theory,   and  Chapter  IV  treats  an  elementary  special 
case.     Chapter  V  deals  with  the  Minimax  solution.     The   Appendix  gives 
a  review  of  some  selected  mathematical  concepts  needed  to  understand 
better  the  technical  discussions.     It  is  recommended  that  the  reader 
study  the  Appendix  before  beginning  Chapter  II. 

The  thesis  was  written  during  the  period  January  -  June  ,    1955 
at  the  U.    S.    Naval  Postgraduate  School,  Monterey,   California.     I  wish 
to  express  my  gratitude  to  the  Navy  for  affording  me  the  opportunity  to 
write  the  thesis,  to  Professor  Thomas  E.   Oberbeck  for  the  technical 
competence  and  contagious  enthusiasm  he  brought  to  his  task  as  faculty 
advisor,   to  Professor  Walter  Jennings  for  helpful  suggestions  made 
while  serving  as  second  reader,   and  to  Mrs.    D.   P.    Slingerland  for 
painstaking  clerical  assistance. 
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CHAPTER  I 
A  NON-TECHNICAL  DISCUSSION 

I.         Introduction. 

In  naval  planning  it  is  often  necessary  to  predict  the  future  use- 
fulness of  a  proposed  weapon  or  tactic.     To  do  this,    some  value  asso- 
ciated with  the  weapon  or  tactic,    such  as  percentage  success,   average 
missed  distance,   or  average  life,   is  selected  as  a  measure  of  the  use- 
fulness of  the  weapon  or  tactic.     The  problem  then  becomes  one  of  es- 
timating what  this  value,   which  we  shall  refer  to  as  a  parameter  value, 
would  be  in  a  future  war. 

The  usual  procedure  for  doing  this  is  to  conduct  some  trials.     An 
estimate  of  what  the  parameter  value  would  be  in  a  future  war  is  obtain- 
ed as  a  result  of  these  trials.     The  important  thing  to  note  is  that  the 
estimate  is  not  guaranteed  to  be  correct.     We  intuitively  suspect  that 
the  accuracy  of  the  estimate  increases  as  the  number  of  trials  conduct- 
ed increases.     Hence,  the  number  of  trials  to  be  conducted  is  of  funda- 
mental importance. 

The  question  of  how  many  trials  to  conduct  is  often  decided  arbi- 
trarily.    Again,   if  the  services  of  a  statistician  are  available,  the 
naval  planner  may  determine  the  number  of  trials  required  to  give, on 
the  average,   an  arbitrarily  specified  degree  of  confidence  in  the  esti- 
mate.    In  either  case,   some  arbitrariness  is  retained. 

Statistical  decision  theory  adds  a  refinement  to  this  procedure. 
It  employs  a  criterion     based  on  probability  theory     to  select  an  opti- 
mum number  of  trials.     The  process  involves  a  sbrt  of  cost  analysis 


of  the  problem.     In  practical  situations     the  cost  of  conducting  trials 
will  usually  be  significant,   and  a  definite  cogt  may  be  associated  with 
a  poor  estimate  of  the  parameter  value.     To  avoid  the  cost  of  the  trials 
the  planner  is  led  to  conduct  no  trials^or  only  a  few;  to  avoid  the  cost 
of  a  poor  estimate     he  is  led  to  conduct  a  great  number  of  trials.     Obvi- 
ously, the  two  considerations  are  opposed.     The  purpose  of  statistical 
decision  theory  is  to  reconcile  these  two  opposing  considerations,   and, 
by  the  use  of  the  criterion,   to  arrive  at  an  optimum  plan  concerning 
the  number  of  trials  to  be  conducted  and  the  final  decision  to  be  reached. 
Let  us  consider  an  example  to  see  what  this  means. 

2.         Exhibit  A.  l 

Suppose  the  Navy  is  interested  in  a  newly  developed  midget  sub- 
marine to  be  launched  from  a  mother  submarine  and  used  to  kill  enemy 
submarines.     The  question  of  detection  is  not  under  consideration,  but 
merely  the  capability  of  the  midget  to  effect  kills.     It  has  been  decided 
that  the  device  should  be  tested.     Budgetary  considerations,   consider- 
ations of  priority  of  the  services  of  the  testing  agency,   etc.    dictate  the 
necessity  of  answering  the  question:    How  many  trials  are  likely  to  be 
conducted?     The  question  may  be  answered  by  using  statistical  deci- 
sion theory.     But  before  giving  the  answer,   it  is  necessary  to  establish 
some  precepts  to  be  used  in  reaching  it.     The  technical  meaning  of 
these  precepts  will  be  seen  later,   when  we  discuss  each  as  a  datum  of 
the  statistical  decision  problem.     For  the  moment,   let  us  think  of  them 
merely  as  the  ingredients  of  a  recipe.     They  must  be  put  into  the  problem 
if  we  are  to  obtain  a  solution. 

The  example  is  entirely  fictitious  (hence  unclassified),    and  has  been 
chosen  merely  for  illustration.  , 


There  are  five  precepts,   and  they  are  straightforward.     First,   we 
decide  to  classify  each  trial  of  the  midget  submarine  as  a  success  or  a 
failure,   accordingly  as  the  midget  succeeds  or  fails  to  achieve  a  kill  on 
the  trial.     Then,   in  our  problem,   the  percentage  success  of  the  midget 
submarine  in  a  future  war  is  the  unknown  parameter  value  described  in 
the  Introduction.     Second,   we  must  say  something  about  the  relative 
likelihood  of  the  various  possible  parameter  values,   i.e.  ,   the  possible 
values  of  the  percentage  success  of  the  midget  submarine  in  a  future 
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war.     Since  we  have  no  knowledge  to  the  contrary,   we  assume  that  all 
values  in  the  range  0  to  100%  are  equally  likely  to  occur.     Third,   we 
decide  to  accept  the  midget  if  it  succeeds  on  fifty  percent  or  more  of 
its  trials,   and  to  reject  it  if  it  does  not.     This  decision  might  be  based, 
for  example,  upon  an  assumption  that  present  anti- submarine  attack 
methods  will  succeed  from  twenty-five  to  seventy-five  percent  of  the 
time  in  a  future  war.     Fourth,  we  assume  that  the  cost  of  each  trial 
will  be  the  same,   and  a  study  of  the  tactical  situation,  forces  involved, 
etc.   fixes  the  amount  at  $4000  per  trial.     Fifth,   we  have  to  establish 
the  cost  of  a  wrong  decision  as  to  whether  the  midget  is  superior  to 
present  anti-submarine  attack  methods.     We  may  do  this  by  making 
a  careful  study.     The  study  might  consider  such  things  as  the  cost  of 
producing  the  midget,   the  number  that  would  be  produced,   the  cost 
of  alternative  weapons,   etc.  ,   and  it  leads  us  to  an  estimate  of  the 
cost  of  a  wrong  decision  as  shown  in  the  following  table: 
(See  following  page  for  table). 


Decision 

Percentage  success  of 
midget  in  future  war 

Cost  in 
Dollars 

Accept  midget  (trial 

more  than  25% 

0. 

successes  greater 
than  50%) 

Accept  midget  (trial 

less  than  25% 

1,000,000. 

successes  greater 
than  50%) 

Reject  midget  (trial 

more  than  75% 

1,000,000. 

successes  less 
than  50%) 

Reject  midget  (trial 

less  than  75% 

0. 

successes  less 
than  50%) 

Cost  of  Decision 
Table  1. 
Note  that  the  first  and  last  lines  of  the  table  represent  correct  deci- 
sions and  cost  nothing,  whereas  the  second  and  third  lines  represent 
wrong  decisions  and  cost  a  definite  amount. 

Let  us  elaborate  upon  the  nature  of  these  "costs"  of  wrong  deci- 
sion.    They  are  not  actual  amounts  of  money  that  must  be  paid  to 
someone.     Rather,  they  may  be  explained  as  follows:    If  a  wrong  de- 
sion  is  made,,   a  certain  disadvantage  accrues  to  the  Navy  as  a  result. 
The  money  evaluation  of  this  disadvantage  is  called  the  cost  of  wrong 
decision.     This  is  much  like  the  "cost"  to  a  salesman  who  loses  a 
$300  commission  because  he  elects  to  play  golf  instead  of  seeing  a 
prospective  customer.     He  doesn't  have  to  pay  anyone  the  $300,  but 
he  is  nonetheless  $300  worse  off  than  he  might  have  been.     We  say 
he  has  made  a  "costly  decision".     In  short,   the  cost  of  wrong  decision 


is  the  money  equivalent    of  the  loss  suffered  as  a  result  of  the  wrong 
decision. 

In  considering  the  costs  shown  in  Table  1     it  is  necessary  to  keep 
in  mind  the  distinction  between  the  percentage  success  observed  in  the 
trials  and  the  percentage  success  the  midget  submarine  would  have  in 
a  future  war.     The  first  is  an  estimate  of  the  second,   and  is  not  neces- 
sarily correct.    Not^that,    if  the  midget  is  definitely  superior  to  pre- 
sent anti-submarine  attack  methods  (i.  e.  ,  would  have  a  percentage 
success  in  a  future  war  in  excess  of  75%)  and  we  reject  it,  we  must 
penalize  ourselves  $1,000,000.     This  is  the  cost  shown  in  the  third 
line  of  the  table.     Similarly,   if  the  midget  is  definitely  inferior  to 
present  anti-submarine  attack  methods  (would  have  a  percentage 
success  in  a  future  war  of  less  than  25%)  and  we  accept  it,  we  must 
again  penalize  ourselves  $1,000,000.     This  is  the  cost  shown  in  the 
second  line  of  the  table.     On  the  other  hand,   if  the  percentage  success 
of  the  midget  in  a  future  war  is  between  25%  and  75%,   we  do  not  need 
to  penalize  ourselves  for  either  decision.     This  is  not  unreasonable, 
in  view  of  our  earlier  assumption  that  present  anti-submarine  attack 
methods  have  a  percentage  success  of  25%  to  75%.     For,   if  the  mid- 
get would  have  a  percentage  success  in  the  same  range,  we  shall 
consider  that  we  have  really  neither  gained  nor  lost  by  either  accept- 
ing or  rejecting  it. 

The  solution  to  the  problem  can  now  be  given0     It  takes  the  form 
of  a  table  (Table  2).     The  question  of  how  such  a  table  is  obtained  is, 
essentially,   the  subject  of  this  paper.     The  actual  detailed  procedure 
for  obtaining  this  particular  table  is  presented  in  Chapter  IV.     For 


the  moment,   let  us  accept  the  table.     We  can  then  examine  and  inter- 
pret it,   so  that  we  may  gain  an  appreciation  of  the  role  of  statistical 
decision  theory. 

(See  following  page  for  Table  2.  ) 

Let  us  note  the  construction  of  the  Table.     The  numbers  identi- 
fying the  rows  and  columns  designate,   respectively,  the  number  of 
failures  and  successes  of  the  midget  submarine  that  have  been  obser- 
ved in  successive  trials.     Any  set  of  one  row  designator  and  one  col- 
umn designator  locates  a  square  in  the  table.     This  square  then  applies 
to  the  situation  existing  after  the  indicated  number  of  failures  and  suc- 
cesses have  been  observed.     For  example,   the  square  in  row  number 
three  and  column  number  five  applies-  after  three  failures  and  five 
successes  have  been  observed.     Each  square  contains  two  numbers 
(of  dollars).     They  have  the  following  meanings: 

tfpper  number:    the  anticipated  cost  (in  dollars)  to  the  Navy 
if  no  further  trials  are  conducted,   and  a 
decision  is  made  to  accept  or  reject  the 
midget  on  the  basis  of  the  trials  conducted 
thus  far. 

lower  number:     the  anticipated  cost  (in  dollars)  to  the  Navy 
if  trials  are  continued,   and  a  final  decision 
to  accept  or  reject  the  midget  is  based  on 
the  results  of  further  trials. 

The  choice  of  the  words  "anticipated  cost"  in  defining  these 

two  numbers  has  been  carefully  made.     This  is  because  the  dollar 

values  represented  by  these  numbers  in  the  table  are  "expected 

values"  in  the  sense  of  probability  theory.     This  is  di.scussed  in 

the  Appendix.     What  the  reader  must  understand  at  this  point  is 
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that  the  numbers  are  not  absolute  like  the  $4000  cost  per  trial  and  the 
$1,000,000  cost  of  wrong  decision.     Rather,   they  are  values  calculated 
on  the  basis  of  the  likelihoods  of  occurrence  of  the  possible  outcomes, 
much  like  insurance  companies  calculate  the  life  expectancy  of  man 
from  the  relative  frequency  of  deaths  at  each  age. 

The  anticipated  costs  shown  in  Table  2  constitute  the  criterion 
used  to  determine  an  optimum  solution  to  the  problem.     Hence,  the 
solution  is  optimum  relative  to  these  costs  as  a  criterion.     Since 
the  nature  of  the  criterion  is  probabilistic,  the  final  decision,   also,  is 
probabilistic.     What  this  means,   in  practical  terms,  is  that,   if  a  rare 
and  unlikely  series  of  re  suits 'is     obtained  on  the  conducted  trials,   such 
as  success  on  every  trial  when  actual  future  wartime  employment  will 
yield  a  preponderance  of  failures,   a  poor  decision  will  be  made.     This 
is  a  chance  that  must  be  taken  to  avoid  the  great  cost  that  would  cer- 
tainly occur  if  a  very  large  number  of  trials  were  conducted.     It  does 
not  invalidate  the  theory  any  more  than  the  survival  of  one  individual 
to  age  106  invalidates  the  methods  of  insurance  companies. 

We  may  now  proceed  with  the  interpretation  of  the  table. 
Notice  that  the  upper  number  is  greater  than  the  lower  number  in  some 
of  the  squares,   and  equal  to  it  in  others.     Those  in  which  it  is  greater 
are  enclosed  within  the  double  lines.     At  any  stage  of  testing,   corres- 
ponding to  one  of  these  enclosed  squares,  the  anticipated  cost  is  less 
to  continue  taking  trials  than  it  is  to  reach  a  decision  at  that  time.     On 
the  other  hand,   at  any  stage  of  testing  corresponding  to  a  square  out- 
side the  double  lines,  the  anticipated  cost  is  as  little  if  trials  are 
halted,   and  a  decision  to  accept  or  reject  the  midget  submarine  is  made 
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on  the  basis  of  the  sample  already  taken. 

Now  observe  that  the  (0,0)  position  is  within  the  double  lines. 
This  means  that  the  initial  anticipated  cost  is  least  if  some  trials  are 
conducted.     The  number  that  will  be  conducted  depends  on  the  outcome 
of  the  trials.     We  begin  in  the  (0,0)  position,   and  conduct  a  trial.     If 
it  succeeds,  we  move  right  to  the  (0,1)  position;  if  it  fails,  we  move 
down  to  the  (1,0)  position.     In  either  case,  the  second  position  is  still 
within  the  double  lines,   so  another  trial  is  conducted.     This  process 
is  continued  until  a  position  outside  the  double  lines  is  reached.     This 
may  require  anywhere  from  three  to  13  trials.     For  example,   if  the 
first  three  trials  all  succeed,  position  (0,3)  will  be  reached.     Here, 
the  upper  entry  ceases  to  be  greater  than  the  lower  entry,   so  it  will 
pay  to  stop  taking  trials  and  decide,    since  the  percentage  success  of 
the  trials  conducted  of  100%  is  greater  than  50%,  to  accept  the  midget. 
As  another  example,   suppose  the  trial  outcomes  alternate  from  suc- 
cess to  failure  to  success,   etc.  ,   in  that  order.     This  will  result  in 
a  stair  stepping  down  the  table,   returning  to  the  main  diagonal 
(number  of  successes  equal  number  of  failures)  on  alternate  trials. 
Eventually  we  must  arrive  outside  the  double  lines  in  position  (6,  7) 
after  13  trials.     The  percentage  success  for  the  trials  conducted  is 
then 

-y^-x    100    =    53.8%  , 

and  again  the  decision  is  made  to  accept  the  midget.     If  the  sequen- 
ce of  outcomes  leads  to  a  position  outside  the  double  lines  on  the 
upper  side  of  the  main  diagonal,   the  percentage  success  of  the  trials 


conducted  will  be  greater  than  50%,   and  the  midget  will  be  accepted;  if 

it  leads  to  a  position  outside  the  double  lines  on  the  lower  side  of  the  main 

diagonal,   the  percentage  success  of  the  trials  conducted  will  be  less  than 

50%,   and  the  midget  will  be  rejected. 

With  the  aid  of  Table  2     it  is  now  possible  to  answer  the 

earlier  question  of  how  many  trials  are  likely  to  be  conducted.     The 

answer  consists  of  Table  2     and  the  following  rule: 

Begin  conducting  trials,   and     following  each  trial,, 
note  the  position  reached  in  the  table.     Continue 
this  until  a  position  outside  the  double  lines  is 
reached,   then  accept  the  midget  if  the  number  of 
successes  exceeds  the  number  of  failures.     Reject 
the  midget  if  the  number  of  failures  exceeds  the 
number  of  successes.      The  minimum  number  of 
trials  required  to  reach  a  final  decision  will  be 
three;  the  maximum  number  will  be  13. 

3.  Another  Aspect. 

A  direct  solution  of  the  problem  has  been  given.     Let  us  now 

consider  a  possible  budgetary  complication.     Suppose  that  $32,000 

has  been  alloted  to  conduct  the  testing  of  the  midget  submarine.     This 

is,   of   course,   an  illogical  amount  in  the  light  of  statistical  decision 

theory.     The  solution  does  not  divulge  exactly  how  much  the  testing 

will  cost.     It  predicts  only  that  from  three  to   13  trials  will  be  required. 

At  $4000  per  trial,  this  amounts  to  a  cost  of  from  $12,  000  to  $52,  000. 

The  dilemna  can  only  be  resolved,   in  the  light  of  statistical  decision 

theory,  by  getting  the  allotment  changed  to  permit  the  flexibility 

required  by  the  solution.     Failing  this,   an  optimal  decision  may  be 

reached^  but  cannot  be  guaranteed.     If  it  turns  out  that  a  position 
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outside  the  double  lines,    such  as  (5,3),   is  reached  within  eight  trials,   an 
optimum  decision  will  be  reached  in  spite  of  the  limitation.     On  the  other 
hand,   if  we  are  still  inside  the  double  lines  after  eight  trials,    such  as  in 
position  (4,4),    sufficient  data  to  indicate  an  optimum  decision  has  not  been 
collected. 

A  variation  of  the  budget  problem  is  the  case  in  which  more  than 
$52,000  is  available  for  the  testing.     In  such  a  case,  expenditure  be- 
yond $52,000  is,   according  to  statistical  decision  theory,   a  waste  of 
funds.     The  solution  will  have  indicated  the  optimum  decision  after  13 
trials,   if  not  before,   and  additional  trials  are  not  called  for  by  the 
theory. 
4.  Summary. 

Exhibit  A  has  been  studied  to  help  provide  a  conceptual  under- 
standing of  what  is  involved  in  the  type  of  solution  of  the  testing  problem 
provided  by  statistical  decision  theory.     It  should  be  remembered  that 
the  precepts,  i.  e.  ,  the  decision  to  classify  each  trial  as  a  success  or 
a  failure,  the  decision  to  either  accept  or  reject  the  midget,  the  speci- 
fication of  the  cost  of  testing,   the  specification  of  the  cost  of  wrong  de- 
cision,  and  the  specification  of  the  likelihood  of  various  values  of  the 
percentage  success  in  a  future  war,   are  necessary  inputs  to  the  pro- 
blem.    Finally,  the  solution  that  is  obtained  is  optimum  relative  to 
the  anticipated  costs  as  the  criterion,   and  the  final  decision  is  proba- 
bilistic in  nature. 
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CHAPTER  II 
GENERAL  FORMULATION  OF  THE  BAYES  SOLUTION 

1.  Basis  of  the  Problem. 

A  datum  of  any  problem  is  defined  to  be  something,   actual  or 
assumed,  that  is  used  as  a  basis  for  reckoning.     The  statistical  de- 
cision problem  has  five  of  these.     In  Exhibit  A  we  considered  them 
intuitively  as  precepts.     Let  us    now  examine  them  in  more  technical 
detail,   and  introduce  a  portion  of  the  notation  of  statistical  decision 
theory. 

a.  Stochastic  Process  X:    A  |tochastic  process  is  defined  as  a 

countable  collection  of  stochastic  (chance)  variables  having  a  joint 
cumulative  probability  distribution.     To  explore  this,   let  us  think 
of  a  countable  collection  of  stochastic  variables 

X      =    "£    X.     f    =    i    X«  »      X^  f    .-**-o»      •••••••!•« 

Let  us  next  think  of  a  countable  set  of  real  values,  one  for  each 
stochastic  variable,   i.e.  , 

*    s  {%}*{*!'  *2*   *3' }   • 

By  definition  (see  Appendix),  the  joint  cumulative  probability  distri- 
bution of  all  the  stochastic  variables  in  the  countable  collection  is 
the  probability  that    X.    <    x.    simultaneously  for  all    i    .     In  other 
words,     F(x)    is  the  probability  that    X.     <    x. ,   X~    <  x^  »   X„    £   x~  , 

simultaneously. 

An  important  special  case  of  a  stochastic  process  should  be 
mentioned.     It  is  the  case  where  the  stochastic  variables    X.   ,   X,  » 
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are  independently  and  identically  distributed.    .The  condition 

of  independence  means  that  the  joint  distribution  function  is  the  product 
of  the  individual  distribution  functions.     In  this  case, 

F(x)    =    G1  (xx)  G2  (x2)  G3  (x3) -JJJ    G.  (x.)  . 

where    G.  (x.)  is  the  distribution  function  of  the    i  stochastic  var- 

iable.    The  condition  that  the  stochastic  variables  be  identically  dis- 
tributed means  that  the  distribution  of  each  stochastic  variable  has, 
not  only  the  same  form  (such  as  normal  or  uniform),  but  also  the 
same  parameter  values.     Thus,  we  might  have 

G£  (x^    =    G  (x.;i4,(T)  for  all    i 

where    G  (x;yu  ,<T  )    denotes  a  normal  distribution  with  mean  yU    and 
standard  deviation     <T  .     In  this  case,  we  may  write 

F(x)    =    F(x;yU,(T)    =    JT    G(Xi;M,<T) 

showing  the  dependence  of    F    upon  the  values  of  the  parameters,/* 
and     <T  . 

The  stochastic  process  of  Exhibit  A  is  an  example  of  one  in 
which  the  stochastic  variables,     X.  ,     are  assumed  to  be  indepen- 
dently and  identically  distributed.     The  outcome  of  each  trial  of  the 
midget  submarine  is  considered  to  be  a  stochastic  variable.     Hence, 
the  result  of  the    i         trial  constitutes  the  stochastic  variable    X.  . 
The  possible  particular  outcomes  of  each  trial,   success  or  failure, 
are  thought  of  as  representing  particular  values  of  the  stochastic 
variables.     The  assumption  that  every  trial  has  the  same  chance  of 
succeeding  as  every  other  trial  is  equivalent  to  the  assumption  that 
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the  stochastic  variables  are  identically  distributed.     The  two  values  to 
which  the  stochastic  variables  are  restricted  (failure  and  success)  are 
denoted  by    0    and    1    respectively.     The  common  percentage  success 
of  each  stochastic  variable,  thought  of  as  a  parameter,   is  labeled    p 
(parameter  value)  .     This  makes  it  possible  to  depict  the  stochastic 
process  diagramatically  by  showing  the  distribution,     G  (x;p)  ,  of  one 
of  the  identically  distributed  stochastic  variables  as  in  Figure  1. 

i-- 


i-r 


0  I 

Bar  Graph  Step  Function 

Distribution  of  One  of  the  Stochastic  Variables  of  Exhibit  A 

Figure  1. 
In  this  case,  we  may  write 

F  (x)    =    F  (x;p)    =    TT    G  (x.;p) 
showing  the  dependence  of    F    upon  the  parameter    p. 

b.          Space  JTL   :    The  space  -A    is  defined  to  be  a  class  of  joint  cumu- 
lative probability  distribution  functions  known  to  contain  the  true  dis- 
tribution,    F  (x)  ,     of  the  stochastic  process.     The  elements  of    S\. 
are  joint  cumulative  probability  distribution  functions  and  differ  from 
one  another  only  in  the  values  of  their  parameters.     Hence,     F  (x;p. )  , 
F  (x;p2)  , ,  F  (x;p   )  ,   .    .    .    are  elements  of  the  space  -f\- 


14 


when    F    depends  on  a  single  parameter.     For  this  reason,   it  is  often 
convenient,  as  well  as  illuminating,  to  think  of  J\   as  a  parameter  space. 
Adopting  this  view  in  subsequent  portions  of  this  paper,  we  shall  refer 
to  elements  of  the  space  -A-  as  values  of  this  parameter.     The  para- 
meter is  then  regarded  as  a  stochastic  variable,     P  ,     and,  as  such, 
is  liable  to  take  on  different  values  with  different  likelihoods.     Note 
that,  following  convention,  we  denote  the  parameter  in  its  role  as  a 
stochastic  variable  by  using  the  capital  letter    P  ,     while  parameter 
values  are  denoted  by  the  small    p    .     In  short,  -H-  is  a  class  of  simi- 
lar joint  cumulative  probability  distribution  functions  having  different 
parameter  values  and  known  to  contain,   as  an  element,   the  particular 
joint  cumulative  probability  distribution  function  having  the  correct 
parameter  value,   or,   as  we  have  referred  to  it  above,  the  true    F    . 
To  determine  an  optimum  way  of  estimating  this  parameter  value  is 
the  crux  of  the  statistical  decision  problem. 

In  Exhibit  A,  the  percentage  success  of  the  midget  submarine 
in  a  future  war  is  the  particular  parameter  value  of  interest.     If 
we  knew  it,  there  would  be  no  problem.     Since  we  do  not,  we  regard 
the  unknown  parameter  as  a  stochastic  variable,     P    .     We  know  only 
that  this  stochastic  variable  is  confined  to  range  between    0    and 
100%.     It  was  stated,   as  a  precept,   that  prior  to  any  experimenta- 
tion we  would  assume  the  true  parameter  value  to  be  anywhere  in 
this  range  with  equal  likelihood.     This  is  equivalent  to  saying  that 
the  stochastic  variable,    P  ,     is  continous  and  that  its  a  priori  dis- 
tribution is  uniform.     The  uniform  probability  density  function  of 
F  »      ^  (p)  »   and  the  associated  cumulative  probability  distribution 
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function,     "^  (p)  »   are  shown  in  Figure  2. 


*(f) 


■%<1» 


A  Priori  Distribution  of  the  Parameter  of  Exhibit  A 

Figure  2. 
Note  that    p    represents  a  possible  value  of  the  stochastic  variable 
P    ,   and  ^  (p)    represents  the  probability  that    P    <    p  . 

c.  Space    D    :      The  space    D      is  defined  to  be  the  space  of  possi- 

ble final  decisions.     To  illustrate    D      ,   let  us  again  refer  to  Exhibit 
A.     We  recall  that,   at  any  stage  of  experimentation,  we  were  always 
faced  with  two  alternative  types  of  decisions,   namely,  to  make  a 
final  decision  or  to  continue  experimenting.     These  two  types  of 
decisions  are  distinguished  by  defining  two  classes  of   decisions: 

D    :      the  class  of  all  terminal  decisions 

D     :     the  class  of  all  decisions  to  continue  experimenting, 
such  as  take  one  more  trial  or  take  two  more  stages 
of  three  trials  each,   etc. 
Now,  in  Exhibit  A,     D      consisted  of  two  elements: 


d.   :      accept  the  midget. 
dj  :      reject  the  midget. 
D      consisted  of  a  single  element: 
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d.    :     take  one  more  trial. 

t  e 

In  general,     D      and    D      are  not  so  restricted,  but  may  consist  of  as 

many  elements  as  needed  to  cover  all  possible  decisions.     This  idea 

is  expressed  symbolically  by: 

D     is  a  class  consisting  of    d.   ,  d_  , 

D      is  a  class  consisting  of    d.   ,  d2    , 


t  e 

To  illustrate  the  relation  between    D      and    D       ,   it  is  convenient  to 

define  the  class    D    as  the  class  of  all  possible  decisions.     It  is  then 

clear  that    D    =    DUD      .     This  is  shown  pictorially  in  Figure  3. 


Decision  Space 
Figure  3 
It  will  be  recognized  that  the  sum  total  of  all  decisions  from    D      and 
D      are  exhaustive  and  mutually  exclusive. 

d.  Weight  Function    W  (p,d  )  :     The  weight  function  is  defined  to 

be  a  non-negative  function,  the  value  of  which  expresses  the  cost  of 
making  the  terminal  decision    d     when  the  true  parameter  value  is 
p  .     It  is  through  the  weight  function  that  the  cost  of  making  a  wrong 
decision  is  introduced  into  the  problem.     If  a  correct  decision  is 
made,  the  value  of    W  (p,d  )    will  be  zero;  if  an  incorrect  decision 
is  made,     W  (p,d  )    may  have  a  positive  value.     In  general,  the 
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weight  function,   like  any  function  of  two  variables,  may  be  depicted  as 
a  surface  as  shown  in  Figure  4. 


/ 


/ 


Weight  Function  (  General  ) 
Figure  4. 
In  the  special  case  of  Exhibit  A,  this  surface  degenerates  into  two 
curves  in  space  as  follows: 


Weight  Function  for  Exhibit  A 
Figure  5. 
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Since  there  are  only  two  elements  in    D     for  Exhibit  A,   it  becomes  more 
instructive  to  represent  Figure  5  as  two  curves  as  shown  in  Figure  6. 
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Alternative  Representation  of  Weight  Function  for  Exhibit  A 

Figure  6. 
Either  Figure  5  or  Figure  6  is  the  equivalent  of  Table  1. 

The  weight  function  is  the  most  difficult  datum  of  the  statisti- 
cal decision  problem  to  specify.     Since  it  is  a  datum,   it  must  be 
khown  before  a  statistical  decision  problem  can  be  solved.     The 
operations  analyst  must  be  able  to  specify  its  value  for  any  v|dues 
of  the  arguments    p    and    d    .     This  amounts  to  saying  that  he  must 
be  able  to  assign  a  numerical  cost  to  any  combination  of  a  possible 
terminal  decision  and  a  possible  parameter  value.     The  question  of 
how  to  do  this  is  one  that  needs  extensive  investigation,  and  offers 
an  opportunity  for  further  study. 

It  is  often  possible  and  desirable  to  classify  decisions  as 
merely  right  or  wrong.     In  such  case,  the  weight  function,     W  (p,d  )  , 
takes  on  only  values     1    and    0    ,   and  is  said  to  be  a  simple  weight 
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function.     Except  for  a  scaling  factor  of    10       ,  this  was  the  case  in 
Exhibit  A. 

e.  Cost  Function    C  (x,  s)  :     The  cost  function  is  defined  to  be  a  non- 

negative  function  expressing  the  cost  of  experimentation.     In  general, 
it  depends  on  the  values,     x    =    x. ,  x?,   .   .    .    ,   obtained  on  the  observa- 
tions.    It  also  depends  on  the  variables  observed  in  each  stage  of  ex- 
perimentation,  and  the  number  of  stages,     s    =    s,  ,   s?,   .    .    .    s,    , 
observed.     However,   it  maybe  possible,   and  is  usually  desirable, 
to  consider  the  special  case  in  which  the  cost  of  experimentation  is 
the  same  for  each  experiment.     Then  the  total  cost  of  experimenta- 
tion is  proportional  to  the  number  of  trials  conducted.     This  was  the 
case  in  Exhibit  A  where  each  observation  cost  $4000,   and  the  cost 
function  had  a  value  of  4000  times  the  number  of  observations  taken. 

2.  The  Statistical  Decision  Function,    &  (x,s)  . 

A  statistical  decision  function,    o     ,   is  a  set  of  rules     which 
estimates  a  parameter  using  the  results  of  observations  of  a  sto- 
chastic process    X.     It  depends  on  the  values    x    ■    x    ,   x?  .    .    . 
obtained  on  the  observations  and  on  the  variables  observed  in  each 
stage  of  experimentation  as  well  as  the  number  of  stages,     s    =    s,   , 
s2  *   '    •s]c  .^  is  a  function  which  prescribes  a  plan  for  conducting 
experimentation  and  reaching  a  terminal  decision.     For  example, 
in  Exhibit  A  the  statistical  decision  function  consisted  of  the  Table 
2i  from  which  instructions  for  experimenting  and  reaching  a  termin- 
al decision  were  obtained.     The  problem  of  statistical  decision  theory 
is,  given  the  stochastic  process    X    ,  the  space  JX.   ,  the  space    D      , 


20 


the  weight  function    W  (p,  d  )    and  the  cost  function    C  (x,  s)     ,   to  find  the 

statistical  decision  function  that  provides  the  optimum  decision. 

3.  The  Risk  Function,     r  (  p,S). 

Each  statistical  decision  function    S     is  an  element  of  the  class  <&" 

of  all  statistical  decision  functions.     To  select  that    </    from   JB   which 

provides  the  optimum  solution  to  a  statistical  decision  problem,   a 

criterion  is  needed.     That  is  the  role  of  the  risk  function.     We  have 

already  seen,  from  Exhibit  A,  that  the  criterion  must  take  account 

of  the  conflicting  costs  of  experimentation  and  wrong  decision.     To 

introduce  these  costs  more  precisely  into  the  risk  function,   let  us 

define 

^i  (p»<£)  :         the  expected  cost  of  decision  [  expected  value  of 
W  (pfd  )  ]    when    p    is  true  and    6    is  used. 

r7  (p»<$)  :         *ne  expected  cost  of  experimentation  [  expected 
value  of    C  (x,  s)  ]    when    p    is  true  and    «S     is 
used. 

Note  that    r.  (p,c£)    and    r?  (p,<£)    are  both  expected  values.     The 
meaning  of  "expected  value"  has  been  discussed  briefly  in  connect- 
ion with  the  anticipated  costs  of  Exhibit  A;  it  is  explained  more  tech- 
nically in  the  Appendix.     Now,     r.   (p,  d)    and    r?  (pfJ)    are,   respec- 
tively, the  expected  (average)  values  of    W  (p,d  )    and    C  (x,  s)    for 

given  values  of    p    and  </  .     That  is,     W  (p,d  )    is  averaged  by  the 

t  r 

probability  that    d     will  be  made  to  give    r.  (p,  d)     ,   and    C  (x,  s) 

is  averaged  by  the  probability  that  the  values    x    will  be  obtained 
when  the  stages,     s    =    s.    ,   s,  .....    s,    ,   are  observed  to  give 
1*2  (p»  cT)    .     The  notion  that  these  averages  are  obtained  for  a  parti- 
cular   (p,  cf)    should  be  kept  clearly  in  mind,  for  we  shall  subsequently 
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require  expected  values  calculated  with  respect  to  the  variables    p    and  cf. 

The  risk  function  may  now  be  defined  to  be  the  sum  of  the  expected  values 

of  the  weight  function  and  the  cost  function  for  given  values  of    p    and     6 . 

That  is, 

r  (p,  6)    -    ^(p,^)    +    r2(p,<5). 

Hence,   the  risk  function,  which  may  take  on  a  value  for  any  pair  of    ar- 
guments   (p,  <£)  »   represents  the  total  expected  cost  associated  with  these 
arguments. 

4.  The  Bayes  Solution. 

The  goal  of  statistical  decision  theory  is  to  select  the  particular 
statistical  decision  function,    Cf0,  that  prescribes  the  optimum  plan  con- 
cerning the  number  of  trials  to  be  conducted  and  the  optimum  terminal  de- 
cision based  on  the  results  of  these  trials.    The  risk  function  is  the  basic  (. 
criterion  to  be  used  in  making  this  selection.    But  the  risk  function,  as  we 
have  seen,  depends  on  both    p    and  c»  for  its  value.    The  dependence  on    p 
makes  it  unsuitable,  in  its  present  form,  as  a  yardstick  for  comparing  the 
relative  merits  of  various  O  .    To  overcome  this  difficulty,  we  need  to  re- 
move the  dependency  on    p  .    This  is  accomplished  by  averaging  out  the    p, 
leaving  a  new  function,  the  average  risk,  which  depends  on  o  alone  for  its 
value.    The  values  of  the  new  function  may  be  ordered  as  to  magnitude, 
and  the  magnitudes  will  vary  with  a  alone. 

Let  us  elaborate  on  this.    It  often  happens  that  a  reasonable  estimate 
of  the  likelihood  of  P  taking  on  various  values,  p  ,   can  be  given  at  the  out- 
set.   That  is,   the  physics  of  the  problem,   a  study  of  past  results,   or  even 
a  shrewd  analysis  may  provide  an  a  priori  distribution  of  jp.     This  simply 
means  that  we  are  able  to  specify,    either  as  an  assumption  or 
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as  a  reasonable  approximation,   some  distribution  function,     ^(p),     that 
describes  the  likelihood  with  which    P    will  take  on  the  values  within  its 
range  of  possible  values.     From  this  point  on,    3   is  assumed  to  be 
known.        If  we  now  take  the  expected  value  of  the  risk  function  with 
respect  to  the  a  priori  distribution  of    P    ,   we  get,  in  Wald's  notation, 

r  (5.4)    =      /r  (p.rf)cll 
-n. 
Notice  that  this  average  risk,   averaged  with  respect  to  the  a  priori 

knowledge  of    P    ,   depends  only  on    %  and    a  >  and    s    is  known. 
This  is  a  significant  result.     It  means  that  the  average  risk  is  suitable 
as  the  yardstick  for  comparing  a  t   since  it  can  be  ordered  as  to  mag- 
nitude,  and  the  magnitudes  depend  only  on   O     .     Our  interest,   of 
course,   is  in  selecting  a  particular    ^      that  makes  the  average  risk 
the  least.     That  is,  we  want  a    o       such  that 

t  {§,  <0    =    Min    r  (f,<f)  . 
°  of 

This  is  often  alternatively  expressed  as 

r  (%,  <f  )    <  r  (?,<0    for  all  6. 

Such  a    cf     constitutes  a  Bayes  solution.    Thus,   a  Bayes  solution  is  a 
d         which  minimizes  the  average  risk,     r(?,«)  ,     with  respect  to 
all     rf   .     It  is  to  be  noted  that  a  Bayes  solution  is  a  solution  relative 
to  a  particular  a  priori  distribution     3  .     The  procedure  employed 
to  arrive  at  a  Bayes  solution  is  summarized  in  the  block  diagram  of 
Figure  7. 


2 

The  case  where     ?    cannot  be  specified  is  discussed  in  Chapter  V. 
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CHAPTER  III 
ASSUMPTIONS  OF  STATISTICAL  DECISION  THEORY 

1.  An  Assumption  Concerning  Each  Datum. 

This  chapter  introduces  some  assumptions  applied  to  the  theory 
oi'  statistical  decision  functions  to  insure  that  solutions  exist.     A 
complete  study  of  the  implications  of  these  assumptions  is  not  at- 
tempted.    Rather,   the  assumptions  are  briefly  presented  here  merely 
to  acquaint  the  reader  with  the  nature  of  the  problem,    so  that  he  may 
gain  some  insight  into  the  character  of  the  restrictions  imposed  by 
the  assumptions.     A  full  treatment  is  given  by  Wald.     One  assump- 
tion regarding  each  datum  of  the  statistical  decision  problem  is  re- 
quired. 

a.  Assumption  1 :"   The  assumption  regarding  the  stochastic  process 

X    is  stated  only  for  the  case  where  the    X.    are  independently  and 
identically  distributed.     In  this  case,   it  is  assumed  that  the  stochas- 
tic process,     X    ,   is  discrete  or  absolutely  continuous.     That  is, 
either  each  component  stochastic  variable  is  discrete,   or  it  is  con- 
tinuous and  has  a  density  function.     Continuous  stochastic  variables 
without  density  functions  are  not  admitted. 

b*.  Assumption  2:    A  convergence  property  regarding  the  space  -TL 

is  required.     However,   it  is  not  necessary  to  explore  the  nature  of 
this" property  for  our  purposes,   since  Wald  shows  that  it  is  a  conse- 
quence of  Assumption  1.     As  such,   it  constitutes  no  additional  limi- 
tation. 
c.  Assumption  3:    The  weight  function,     W  (p,d  )     ,   is  a  bounded 
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function  of    p    and    d      .     Recalling  that  the  weight  function  was  defined 
to  be  a  non-negative  function  which  describes  the  cost  of  making  any 
particular  terminal  decision,  d      ,  we  teee  that  this  assumption  merely 
excludes  the  possibility  of  any  decision  costing  an  infinite  amount. 

d.  Assumption  4:     The  space    D      is  compact  in  the  sense  of  the 

metric 

R  (d*    .  d* )    a    Sup     |  W  (p,  d*   )  -  W  (p,  d\  )   |  . 

This  assumption  is  fulfilled  if  the  space    D      is  finite.     That  is,   if  the 
number  of  terminal  decisions  which  may  be  made  is  finite,  the  assump- 
tion is  satisfied.     This  will  cover  most  cases.     However,   if    D      is  not 
finite,   the  assumption  can  generally  be  satisfied  by  restricting  the 
range  of  the  unknown  parameter  to  a  bounded  space.     This  restric- 
tion appears  to  present  no  practical  difficulty. 

e.  Assumption  5;     The  cost  function  ,     C  (x,  s)    ,   satisfies  the 
following  three  conditions: 

(1)  C  (x,s)    ^    0    for  all    x    and    s     ,  and    C  (x;s.,   .   .   .   ,s,  ,.)  ^ 
C  (x-.Sj sk)  . 

(2)  For  any  given    s     ,  the  cost  ,     C  (x,  s)  ,     is  either  a 
bounded  function  of    x    or    G  {x,  a)    «  CO    identically  in    x    . 

(3)  There  exists  a  sequence,    [  c       ]   ,     (m  =  1,2,   ....  ad.   inf.) 

of  positive  values  such  that 

Lim    c        =  CO    ,     and 
m 

m    =  CO 

C  (x,  s)  >    c        for  all    x    ,  and  for  all    s    =        s. ,   .    .   .    ,  s,  ] 
m  1  k 

for  which  the  set  theoretical  sum  of    s,,   .    .   .    ,  s,      contains 
at  least    m    elements. 
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The  meaning  of  this  assumption  concerning  the  cost  of  experimentation 
is  given  in  words  as  follows: 

(1)  The  cost  of  experimentation  cannot  be  negative,  and  the  total 
cost  of  experimentation  after  an  additional  stage  is  taken  can- 
not be  less  than  it  was  before. 

(2)  The  cost  of  experimentation  is  either  finite  or  it  is  impossible 
to  make  observations  of  certain  variables. 

(3)  Regardless  of  the  values  of  the  observations  made  or  the 
number  of  stages  employed  in  making  them,   if  the  total 
number  of  observations  is  at  least    m  ,  then  the  cost, 

C  (x,  s)  ,     of  these  observations  is  not  less  than  the    m 
term  in  some  increasing  sequence,     c       ,     which  approaches 
infinity  as  a  limit.     The  basic  idea  of  this  is  that  there 
exists  some  minimum  value  of  the  cost  of  observing    m 
variables  beyond  which  it  is  impossible  to  reduce  the  cost 
of  observing    m    variables  by  rearranging  the  composi- 
tion of  the  stages  of  experimentation.     In  other  words, 
it  is  not  possible  to  observe  more  variables  for  less 
money  by  taking  the  stages  wholesale. 

2.  An  Assumption  Concerning  the  Space  J0*   . 

An  assumption  concerning  the    space   JB   of  admissible  decision 
functions  is  made  in  addition  to  the  assumptions  concerning  each    da- 
tum.    The  most  essential  portion  of  the  assumption  is  that  only  those 
decision  functions  which  prescribe  a  finite  amount  of  experimentation 
and  which  lead  to  a  terminal  decision  are  to  be  considered. 

3.  Some  Consequences  of  the  Assumptions^ 

Regardless  of  how  slight  the  cost  of  experimentation,  if  one  ex- 
perimented an  infinite  amount  the  cost  would  increase  without  bound. 
Therefore,  there  exists  a  point  beyond  which  further  experimentation 


2$ 


is  not  profitable.     This  intuitive  notion  is  developed  rigorously  by  Wald 
when  he  shows  that,   even  though  we  limit  ourselves  to  decision  func- 
tions which  prescribe  a  finite  amount  of  experimentation,  we  can  still 
approach  an  optimum  solution  arbitrarily  closely  under  the  assumptions 
of  this  chapter. 

Subject  to  the  assumptions  of  this  chapter,  a  Bayes  solution 
exists  for  any  given  a  priori  distribution,      5    .     If  it  is  not  practica- 
ble to  specify  an  a  priori  distribution,  then  the  decision  problem  may 
be  viewed  as  a  zero-sum,  two  person  game  in  the  sense  of  von  Neu- 
mann's theory  of  games,  and  a  minimax  solution  exists.     A  minimax 
solution  is  a  Bayes  solution  relative  to  the  least  favorable  a  priori 
distribution.     The  minimax  solution  is  discussed  further  in  Chapter 
V  . 
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CHAPTER  IV 
THE  BAYES  SOLUTION  FOR  A  SPECIAL  CASE 

1.  General. 

The  general  formulation  of  the  Bayes  solution  to  the  statistical 
decision  problem  was  given  in  Chapter  II,   and  some  of  the  theory  un- 
derlying its  development  was  pointed  out  in  Chapter  III.     In  this  chap- 
ter, we  shall  undertake  a  progressive  restriction  of  the  general  problem 
until,  ultimately,  we  arrive  at  the  special  case  illustrated  by  Exhibit  A. 
Thereupon,  the  detailed  solution  of  Exhibit  A  will  be  indicated.     The 
first  step  in  this  process  will  be  to  consider  a  statistical  decision  pro- 
blem in  which  the  stochastic  variables  are  restricted  to  be  indepen- 
dently and  identically  distributed,  and  the  cost  of  experimentation  to 
be  proportional  to  the  number  of  observations.     Then  we  shall  proceed 
to  the  case  where  the  stochastic  variables  are  further  restricted  to 
take  only  two  values.     The  discussion  of  the  latter  will  terminate  with 
the  solution  of  Exhibit  A. 

2.  Independently  and  Identically  Distributed  Stochastic  Variables 
with  Simple  Cost. 

Recalling  that  the  object  of  statistical  decision  theory  is  to 
find  the  "best"  decision  function,  we  may  readily  see  how  the  restric- 
tions we  are  imposing  will  help  us.     By  restricting  the  cost  function 
to  be  simple,  i.  e.  ,  by  requiring  the  cost  of  experimentation  to  be 
proportional  to  the  number  of  observations,  we  make  it  possible  to 
ignore  the  manner  in  which  the  observations  are  grouped  or  arranged. 
That  is,  we  may  consider  only  those  decision  functions  for  which 


each  stage  of  experimentation  consists  of  exactly  one  observation.     Fur- 
ther, by  requiring  the  stochastic  variables     X.    to  be  independently  and 
identically  distributed,  we  eliminate  the  need  for  concern  as  to  which 
particular  stochastic  variables  are  observed.     As  a  consequence,  we 
may  limit  the  decision  functions  considered  to  those  which  not  only 
prescribe  a  single  observation  per  stage,  but  also  prescribe  that  the 
stochastic  variables  will  be  observed  in  order.     This  is  possible  be- 
cause the  stochastic  variables,  being  identical,   maybe  ordered  in 
any  desired  way. 

In  continuing  our  search  for  a  "best"  decision  function,  we  may 
now  assert  that,   in  choosing  it,  we  need  only  compare  the  merits  of 
decision  functions  falling  into  the  limited  category  explained  in  the 
preceding  paragraph.     And  since  we  are  seeking  a  Bayes  solution, 
the  decision  function  we  ultimately  select  will  be  the  one  that  is 
"best"  in  the  sense  of  the  Bayes  solution  of  Chapter  II.     The  Reader 
will  recall  that  the  Bayes  solution  is  given  relative  to  an  a  priori 
distribution  ^(p)    in  _T\_  ,  and  that  it  consists  of  that  decision  func- 
tion,   cf     ,     which  minimizes  the  average  risk  -  the  average  being 
taken  with  respect  to    ^    and  the  minimum  over  all    a    .     With  these 
facts  in  mind,  we  may  proceed  with  the  process  of  comparing  the 
average  risk  produced  by  each      o  ,  and  the  choice  of  the    0 
which  produces  the  least  average  risk. 

Let    m    be  a  non-negative  integer,  and  let    J  denote  a  deci- 

sion function  which  guarantees  that  the  total  number  of  observations 
will  not  exceed    m    .     Then,  for  any  a  priori  distribution     S  ,  we 
may  define 
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(5)    =    Min    r  (?,c*m) 


m  rm^/ 


J" 

to  be  the  least  average  risk  that  can  be  found  by  considering  only  deci- 
sion functions  which  guarantee  no  more  than    m    observations. 

Similarly, 

P(%)    =    Min    r  (£,<£) 
'  of 

is  the  least  average  risk  to  be  found  by  considering  all  decision  func- 
tions, whether  or  not  they  prescribe  a  finite  number  of  observations. 
A  particular  decision  function  that  belongs  to  both  classes    a    and 

cf       ,     which  we  will  be  interested  in,   is    <5       .     This  is  the  decision 
function  which  is  guaranteed  to  prescribe  no  observations.     It  is  of 
interest  because  it  enables  us  to  write 

0(1)    =    Min    r  (%,6°)    =    Min    W^.d1)    . 

This  is  an  obvious,  but  important  relation.     It  says  simply  that  the 
least  average  risk,   if  we  consider  only  decision  functions  which  pre- 
scribe no  experimentation,   is  equal  to  the  minimum  cost  of  decision. 
This  follows  from    the  definition  of  risk  (  cost  of  experimentation 
plus  cost  of  decision  )  as  given  in  Chapter  II,  and  the  fact  that  no  ex- 
perimentation is  involved. 

Two  remarks  at  this  point  may  assist  the  reader  in  avoiding 
misunderstanding.     First,  whereas  cumulative  distribution  functions 
(  such  as    5  )    are  usually  employed  in  logical  developments,  the 
corresponding  density  functions  (  such  as  ?  )   are  more  often  used  in 
calculation.     The  distinction  should  be  constantly  remembered. 
Second,  the  present  chapter  requires  Assumptions  1-5  of  Chapter  III, 
but  does  not  require  the  assumption  concerning    Jj  -  a  fact  the  reader 
may  have  surmised  from  the  introduction  of      />(%)„ 
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There  are  several  theorems  concerning  the  functions    jP     (%)     , 
P  (§)    aTK*    r     ^5)    which  enable  us  to  compare  various  average  risks 
and  lead  us  to  the  Bayes  solution.     Perhaps  the  most  important  of  these 
is  the  recursion  formula 

A* 


f- 


(A)  /am+1      =        Min 


+  /  >m(?a)df*(a|S) 


•'-CO 
We  need  to  examine  this  formula  carefully  and  understand  it  thoroughly. 

It  contains  several  symbols  not  given  explicitly  before.     They  are 

a:  stands  for  a  value  that  might  be  obtained  if  a  stochas- 

tic variable  were  to  be  observed.     When  none  is  ob- 
served, but  advance  calculations  are  made  with  the 
thought  in  mind  that  one  could  be,   then  the  symbol    a 
may  be  thought  of  as  a  stochastic  variable  itself. 

f   (a|p):         a  cumulative  distribution  function  for  the  stochastic 
variable    a    described  above  that  would  exist  if    p 
were  the  true  parameter  value  of  the  joint  cumula- 
tive distribution  function    F  (x)    . 

f  (a  |s):        the  expected  cumulative  distribution  function  of    a    ob- 

*      1 
tained  by  calculating  the  expected  value  of    f  (a|p)  . 

*      1 
That  is,  f  (a|p)  is  weighted  by  the  a  priori  knowledge, 

\,   of  the  distribution  of    p    in  -A. to  obtain  the  average. 

c:         the  cost  of  one  observation 

?5   :  the    a  posteriori  cumulative  distribution  function 

of    P  in  _TL  based  upon  the  observation    a    . 
If     ^    is  an    a    priori  distribution  and    a    is  the 
result  of  a  single  observation,  then      "£>  is  an 

a  posteriori  distribution  obtained  by  applying 
Bayes  theorem  (Appendix  A)  to  modify      ^   to 
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"%■      -  the  modification  being  based  upon  the  observation    a. 

Combining  these  notions,   it  is  possible  to  paraphrase  the  recursion 
formula  as  follows: 

the  least  average  risk     =      the  minimum  of: 

produced  by  decision  (1)    the  least  average  risk  produced  by 

functions  which  pre-  decision  functions  which  prescribe  no 

scribe  from    0    to  observation 

m  +  1  observations  (2)    the  cost  of  one  observation  plus  the 

expected  value  of  the  least  average 
risk  produced  by  decision  functions 
which  prescribe  from    0    to    m 
observations  after  the  first  one 

This  formula  seems  reasonable  and  its  validity  may  be  shown  under 

the  assumptions  of  Chapter  III.     If  we  want  to  know  the  least  average 

risk  to  be  had  by  allowing  decision  functions  prescribing  from    0    to 

m  +  1    observations,  we  can  surely  get  at  it  by  breaking  the  decision 

functions  we  are  allowing  into  two  groups  and  picking  the  minimum 

one  of  the  two  least  average  risks  attainable  from  these  two  groups. 

If  the  breakdown  is  made  into  (1)  decision  functions  prescribing  no 

observation  and  (2)  decision  functions  prescribing  from  1    to    m  +  1 

observations,  we  are  set  up  to  select  the  minimum  as  indicated  in 

the  recursion  formula.     The  least  average  risk  attainable  from  the 

first  group  is  simply    O    (f)    ,  as  previously  defined.     The  least 

average  risk  attainable  from  the  second  group  is  more  complicated. 

Since  this  group  prescribes  from    1    to    m  +  1    observations,  we  are 

certain  to  take  at  least  one  observation.     This  accounts  for  the    c 

in  the  formula.     After  this  one  certain  observation  is  taken,   its  value 

being    a    ,   it  is  possible  to  modify  the  a  priori  distribution  ^(p)    in  J~\- 


to  an  a  posteriori  distribution    I5     (p)    in  _fL  by  the  Bayes  theorem  of 
Appendix  A.     At  this  point  we  would  want  to  proceed  by  using  the  a 
posteriori  distribution    2;     ,   since  it  is  an  improvement  over  the  a 

3. 

priori  distribution.     To  do  so.  we  would  calculate  the  least  average 

risk  produced  by  decision  functions  prescribing  from    0    to    m    more 

observations j  that  is,     O       (f   )    as  previously  defined.     This  would 

give  us  an  expression, 

c    +      O      (?  ) 
ym  v^a 

for  the  least  average  risk  attainable  from  our  second  group  of  deci- 
sion functions.     The  reasoning  thus  far  has  omitted  one  subtle,  but 
key  point.     It  is  that  the  single  observation    a    is  never  actually 
taken.     Therefore,  we  must  consider  all  possible  values  that    a 
might  take  in  a  future  observation.     To  do  this  we  must  regard  the 
value    a    as  a  stochastic  variable,   and  compute  an  expected  value 
of     /0     (^  )    with  respect  to  the  distribution  of    a    .     This  accounts 
for  the  fact  that  the  second  choice  on  the  right  side  of  the  formula 
takes  the  form 

c  +  //^<^>df*<a;i5)  ■ 

Wald  has  shown  that  O  (£)  will,  for  a  sufficiently  large 
value  of  m  ^  differ  from  P  (§)  by  an  arbitrarily  small  amount. 
This  permits  us  to  write 


(B) 


m=  00 


and  leads  us  from  formula  (A)  to  the  formula 

7>o  M) 


(C)       p{\)    =    Min 


rOO 


1 


(yO(£a)df     (a|$) 
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This  formula  is  presented  in  the  notation  of  the  Stieltjes  Integral  (see 
Appendix  A)j   and  does  not  distinguish  between  the  case  where  the  sto- 
chastic variable    a    is  discrete  and  the  case  where  it  is    gontinuous. 
If  we  desired  to  do  so  we  could  write 


(Cj) 


.*  ' 


/° 


(%)    =    Min 


(2a)f        (a  |S) 


where    f  is  the  bar  graph  of  a  discrete  stochastic  variable,  and 


<C2> 


.*/ 


P 


(§)    =    Min 


?a)f*'(a|5)da 


where    f         is  the  density  function  of  a  continuous    stochastic  variable, 

The  payoff  of  the  preceding  discussion  lies  in  the  manner  in  which 

X^(§)    and    O    {%)    maybe  used  to  obtain  a  Bayes  solution.     It  is  best 

explained  by  Wald  when  he  says: 

A  Bayes  solution  relative  to  a  given  a  priori  probability 

measure    ^      can  immediately  be  given  in  terms  of  the 

functions  yO(%)    and   p    {%)    as  follows:    If    P  ( %  )    m 

P    (  ^  )     i  do  not  take  any  observation  and  make  a 

final  decision    d      for  which    W  (5  ,d      )    =    Q    (5).    If 
o  ^*o      o  I  o       o 

/O  {*$  )    <   /0    m  )     ,  take  an  observation  on    X.      and 
compute  the  a  posteriori  probability  measure    5^  corres- 
ponding to    H      and    x      .     If  P  {%xt)    -   P    (^X,)    ,   stop 
experimentation  and  make  a  final  decision    d      for  which 
WtSc.d*)    =/>0(§x,)    •     K  />(&,)    <yOo(5x,)     .take 
an  observation    x2    on    X_    ,     In  general,  after  the  obser- 


vations   x 


1 


.    ,  x        have  been  made,   take  an  ad- 
m 


m 


ditional  observation  if    P{$%t,   .    . 
0     (5x/»   ....  x      )    ,     and  stop  experimentation  with 
a  proper  terminal  decision  ijlff    yO(§x,,    •-  •   •    »  x      )    = 
/o  (5X|»    •    •    •    »   xm  )  .     where    ^*(,    .    .    .     x^    denotes 
the  a  posteriori  probability  measure  corresponding  to 
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^o'  xl 


»     •     •     •     »    3C__ 

m 


3.  Stochastic  Variables  Limited  to  Two  Values. 

The  case  where  the    X.    are  restricted  to  take  only  two  values  is 
quite  special.     It  will  arise  when  the  value  of  each  variable  may  be  con- 
sidered to  be  a  failure  or  a  success,   as  in  Exhibit  A.     In  such  cases, 
the  values  of  the  stochastic  variables  are  taken  as    0    and    1    .   These 
correspond  respectively  to  failure  and  success.     The  following  short- 
hand notation  is  used  to  describe  cumulative  distribution  functions. 
"^       :    an  a  priori  cumulative  distribution  function  of   p    in  -A- 

^. .     :    the  a  posteriori  distribution  of   F    in  -A.  after    i    O's 

and    j    l's  have  been  observed,      ^oo  is  the  same  as    ^     . 

If  there  exists  a  positive  integer    m    such  that 

O     (§     .)    <    c    and    O    ($.       )    <    c    for    i=l,2„,m; 

j  =  1,  2.  .  .  m  , 

then  it  is  clear  from  formula  (C)  that 

P(?     .  )    =  P    (f     •  )    and  P(f.       )    =  0     (f .       )     for 
/     *  ^mj  '        /    °       mJ  /  im  /    °       im 

i  ■  1»  2. .  „  m; 

j=   1 ,  2.  ,  .  m  . 

This  may  be  explained  in  words  as  follows:    Suppose  an  integer    m 
exists  such  that  when  either    m    O's  or    m    l's  have  been  observed, 
and  the  attendant  a  posteriori  distributions  computed,   it  is  found 
that  the  least  average  risk  attainable, by  allowing,  from  this  point 
on,  decision    functions  which  prescribe  no  experimentation, does  not 
exceed    c    .     Then  from  formula  (C)  the  least  average  risk  attain- 
able by  allowing  decision  functions  prescribing  any  amount  of  exper- 

3 

Wald  uses  the  term  probability  measure  where  we  have  been  using 

cumulative  distribution  function.  , 


imentation  is  equal  to  that  which  is  attainable  by  allowing  only  decision 
functions  which  call  for  no  further  experimentation. 

Let  us  now  define    p. .    to  be  the  probability  of  obtaining  the  value 
1    on  a  single  observation  when    ^. .    is  the  a  priori  distribution. 


That  is, 


•«  =  A(iip)di..(P), 


(D)       /Odjj)    =    Min 


*/ 

is  the  f         of  formula  (C.)  . 

Then  the  probability  of  obtaining  the  value    0    on  a  single  trial  is 

1  -  p. .    .     Using  this  notation,  the  formula  (C , )  of  the  preceding  sec- 

•I 

tion  may  be  adapted  to  the  case  where  the  stochastic  variables  take 
only  two  values.     It  becomes 

It  is  this  particular  form  of  the  formula,  along  with  the  defining 
relation 

(E)       O    (%t.  )=Min    Wtf.-.d1)    =    MinJW  (p.d*)  §'(p)  dp 

given  earlier  and  the  Bayes  theorem  of  Appendix  A  that  we  shall  use 
in  solving  Exhibit  A.     The  details  of  their  use   are  best  seen  by  study- 
ing the  detailed  solution  of  the  problem. 

4.  The  Solution  of  Exhibit  A. 

The  dollar  values  given  in  the  original  presentation  of  Exhibit  A 
in  Chapter  I  may  be  multiplied  by  10         without  altering  the  procedure 
followed  in  solving  the  problem.     This  amounts  to  expressing  all  costs 
in  millions  of  dollars.     Making  this  simple  transformation  and  convert- 
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ing  each  original  "precept"  of  the  problem  into  a  technical  datum,  as 

subsequently  introduced,  we  have  given: 

(1)    the  stochastic  process:    X.    =    0    (failure)    or    X.    =    1    (success) 


l-f 


i 


Distribution  of    X 
Figure  '%. 
(2)    the  a  priori  distribution  in  the  parameter  space: 


f 


l  '  o 

Distribution  of    P 
Figure  9. 

(3)  the  decision  space:     D     consists  of  two  elements: 

d.     :    accept  the  midget  submarine 
d?     :    reject  the  midget  submarine 

(4)  the  weight  function: 

W    (p.d*   )    =    0    for    p   >    i 

=:     1      for      P     <     T 

W  (p.d*   )      =    0    for    p    <    | 

3 
■     1    for    P   >    4 
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.95 


•f 


.15     2.00 


l.OO  I  O 

Weight  Function 

Figure  10. 
(5)    the  cost  function:    G    s    .004,   the  cost  of  a  single  experiment. 
The  Bayes  solution  to  this  problem,  that  is,  the    Q      that  we 
seek,   is  a  table.     It  is  the  same  table  that  was  given  in  Chapter  I. 
The  upper  entries  in  the  cells  of  the  table  are  values  of   f)    {%■■  )  , 
while  the  lower  entries  are  values  of    X^(^. .  )  .     Hence  the  table 
provides  the  comparison  of    yOQ  (§..  )    and    P(^.  )  needed  to  deter- 
mine how  to  experiment  and  reach  a  terminal  decision.     Our  imme- 
diate task  is  to  calculate  these  values  of    /O    (£•  •  )    and    /)(%..  )    to 
complete  the  table.     We  may  begin  by  calculating  the  values  of 


f 


/°- 


{%..  )    for  successive  diagonal  entries    (i=j)    from  formula  (E) 


and    Figures  9  and  10. 

qo>  h 

W  C«_.d*  )    =   J  W  (p.d*  )  §'(p)  dp   =    j{) 

lb       l 

=    PJo     =      4 


& 


'oo' 


1)  (1)  dp    +    |(0)(l)dp 


W  (?oo'd2  >      =     XW  (p'd2  )  ^'(p)  d^    = 
/°o(^oo)  ■    M^    W<^oo'dt>    ■    Min 


3/* 

(0)(1) 


1 

4 
I 

4 


4*jl 


lKDdp- 


1 

4 


2500 
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The  remaining  diagonal  entries  of    &     (^..  )    are  computed  using  the 
Bayes  theorem  (Appendix  A,  Case  II)  as  well  as  formula  (E)  and  Fig- 
ures 9  and  10  . 


A<gii>  : 


f'       =        (D(1-P)(P) s  P-P  =      6p    -    6p2 

/'(l)(l-p)(p)dp  p_2        p_3l' 

2      "    3  it 


(^irdi  )  =     f (1)(6p  "  6p2  }  dp  +     A°X6p  -  6p2>dP 

-2  -     3 1  5 

=      3p       -    2pj  =     3r 

(5ird2 }  =    J  (0)(6p  "  6p2 )  dp  +    J,(1)(6p  - 6p2  }  dp 


=      3  P*    -    Z  P3| 


A(£H)    =    Min    W(fH,dt)    =    Min  -    -£-   =    .1563 


The  procedure  may  be  generalized  for  all  diagonal  entries    (i  =  j) 
so  that  we  have 

D&..)     =     /        (1  -  P)W  dp     m  Jo    <X  '  p)1  «*  dp 


j[(l  -  p^p)1  dp  /  (1 " p)i(p)i  dp 


for  all    i    .     Values  of  this  last  expression  may  be  obtained  from 

4 
Tables  of  the  Incomplete  Beta  Function.  The  use  of  these  tables 

permits  easy  evaluation.     Values  obtained  are  the  entries  shown  in 

the  upper  halves  of  the  diagonal  cells  in  Table  3. 

The  next  step  is  to  calculate  the  non-diagonal    (i#/ J)    upper  entries. 

This  is  done  as  follows: 

/°o(V>    ' 


4 

Tables  of  the  Incomplete  Beta  Function,   Pearson,   University  Press. 

Cambridge,    1934  .  .  _ 


■£*  (D(l   -  p)     p  P     -  2p     ±  P  /n  /    3       _4.      5. 

?23      "    /;,"  ,2P3,        =     p    4        ,    5   P      fc-,,  =    60  (P     -  2p     +  p  ) 

jo(l)(l-p)     pdp  [|--^L  +    £^ 

W  (^3  .d*  )      =      J(l)[60(p3  -  2p4  +  p5  )]  dp    =  [l5p4  -  24p5  +  10p6]o 

■      -0*76  , 

W  (fZ3  '4  )    *       f  (D[60(p3  "  2P4  +  P5  )  ]dP    =  [15P4  -  24P5  +  10P6L 

-T*  L  J3k 

-      .1700 

[".03  761 

/°o<»23>    =    Min       L-1700]     =      '0376 
Again  the  procedure  generalizes  and  we  have,  for    i    <    j    , 

JL  (1   -  P)1  (P)J  dp 

/>o<V     =     * -—     ' 

'  Jo  (1  -P)x(p)3dp 

For    i    >    j    we  have 

A(1 -P)i(P)JdP 

/>o<V    =    —, —  ' 

j  (i  -  p)1  (P)j  dp 

As  before,  the  evaluation  may  be  accomplished  by  use  of  the  Tables  of 
the  Incomplete  Beta  Function.       Note  that    P    (^. .  )    =    0     {%..  )    . 
This  makes  it  necessary  to  evaluate  entries  on  only  one  side  of  the 
main  diagonal,   since  the  remaining  entries  may  be  determined  by 
symmetry. 

With  the  upper  entries  filled  in,  we  turn  our  attention  to  the  lower 
entries.     They  may  be  determined  in  two  stages.     The  first  stage  is  t 
just  to'  compare    Q     (%..  )    with    c    for  each  square.     Since 
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(ft>)    /)(?..  )      =      Min 


/°^ij 


we 

s 


c  +  Pij/°<^i.j+i  >  +  ^-V/xW 

may  immediately  select  yO     (?..  )    as  the  value  of  P  (?. .  )    for  all 
quares  in  which    O    (§ . .  )    <    c    .     For  those  squares  in  which  0    (?..  ) 
y  c    we  must  use  formula  (D)  and  the  formula  for    p..    to  calculate 
P(%..  )  .     For  example,   in  the  case  of  diagonal  entries  where    p..  =    y  , 
we  may  compute 

c    +   P,9/o(*9il0)    +    (1-^I^V,I 


yO(f99  )    =    Mm 


Min 


.0090 

.004    +    j    (.0039)    +    j    (.0039) 


=    Min 


.0090 
.0079 


=      .0079 
In  the  case  of  non-diagonal  entries,   the  first  step  is  to  compute    p 
from  the  formula 

p..     =     (i  (1  |p)  d?..     -     Jf  (1  |p)  g'£j  (P)  dp  . 

Upon  substituting  in  this  formula  we  have 
/ 


i1  /~\J 


ij 


/(1-P)1  (P)j+1  dp 
(P)     {]   P)    )P)^        dp     -^ ■ . 

J^d-pjMp^dp  jf  (l-p^p^dp 


This  last  expression  may  be  evaluated  using  the  Tables  of  the  Incom- 
plete Beta  Function.      Once    p. .    is  known,  we  have  only  to  solve  formu- 
la (D)  for  £>(%..  )    .     V  allies  of    /)(?..  )    obtained  in  this  way  complete 
Table  3. 
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The  best  sequence  for  calculating  the  lower  entries  is  as  follows: 
Fill  in  the  main  diagonal  entry  in  the  lower  right  hand  corner  first.. 
Then  progress  to  the  left  in  that  row.     Next,   move  up  to  the  next  higher 
diagonal  entry  and  again  work  left  on  the  row.     The  entries  on  the  upper 
right  hand  side  of  the  diagonal  can  be  filled  in  by  symmetry. 

The  interpretation  of  the  table,   as  given  in  Chapter  I,  may  now 
be  stated  in  terms  of  the  technical  notation.     Begin  taking  observa- 
tions and  after  each  observation  compare    P(?>..  )    with    O     (?..  )    . 
As  long  as    /3(§..  )    is  less  than    b     (£  •  •  )  »   continue  taking  observa- 
tions.    When  an  observation  is  made  such  that   yO(§.„  )    =   yO     (5-.  )  . 
stop  experimentation  and  mafte  a  proper  terminal  decision.     If    i  >  j  , 
the  terminal  decision  will  be  to  reject  the  midget  submarine.     If 
i    <    j  j     the  terminal  decision  will  be  to  accept  the  midget  submarine. 
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CHAPTER  V 
THE  MINIM  AX  SOLUTION 

1.  The  Minimax  Solution  and  its  Relation  to  the  Bayes  Solution. 

The  scope  of  this  paper,  for  detailed  discussion,   is  limited  to  the 
Bayes  solution  of  the  statistical  decision  problem.     Emphasis  is  given 
to  the  special  case  in  which  the    X.    are  independently  and  identically 
distributed,  and  confined  to  take  only  two  values.     However,  to  avoid 
having  the  reader  assume  that  this  constitutes  all  of  statistical  de- 
cision theory,  mention  should  be  made  of  the  Minimax  solution. 

It  was  pointed  out  in  Chapter  II  that  the  Bayes  solution    is 
always  given  relative  to  an  a  priori  distribution  of  the  unknown  para- 
meter.    If  such  a  distribution  cannot  be  given,   it  may  still  be  possi- 
ble to  solve  a  statistical  decision  problem.     A  solution  may  be  ob- 
tained by  viewing  the  decision  problem  as  a  zero  sum,  two  person 
game,  and  solving  the  game.     A  solution  obtained  in  this  manner  is 
termed  a  Minimax  solution.     A  Minimax  solution  may  also  be  obtain- 
ed in  other  ways.     A  Minimax  solution,  as  noted  in  Chapter  III,   is  a 
particular  Bayes  solution.     Specifically,   it  is  that  Bayes  solution 
which  is  given  relative  to  the  least  favorable  a  priori  distribution  of 
the  unknown  parameter. 

The  difference  between  the  Bayes  solution  and  the  Minimax 
solution  lies  in  the  choice  of  a  yardstick  for  comparing  the  relative 
merits  of  the  various  decision  functions.     The  basic  criterion,in 
either  case  ,  is  the  risk  function.     But  the  modification  of  this  cri- 
terion.to  arrive  at  the  final  yardstick,   is  different.     The  reader  will 
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recall  from  Chapter  II  that,  for  a  Bayes  solution,  the  risk  function, 
r  (p»cf )    »  was  modified  to  an  expected  risk,     r  (^,c$)  ,     by  averaging  out 
the    p    ,  and  this  expected  risk  constituted  the  final  yardstick.     The 
modification  was  accomplished  by  using  the  a  priori  distribution,  S(p). 
The  expected  risk,  which  could  then  be  ordered  as  to  magnitude  where 
the  magnitude  depended  on    cf    alone,  permitted  the  selection  of  the 
particular  statistical  decision  function,    cf     ,  that  provided  the  least 
expected  risk  and  hence  the  optimum  solution  relative  to  the  assumed 
%  (p)  .     In  the  case  of  a  Minimax  solution,  we  consider  that  an  a 
priori  distribution  is  not  available.     Hence,  the  procedure  employed 
to  modify  the  risk  function  to  a  suitable  final  yardstick  must  be  alter- 
ed.    The  procedure  that  is  used  consists  of  taking  the  maximum  risk 
vice  the  expected  risk.     An  a  priori  distribution  of   p    is  not  required 
to  do  this.     We  simply  take  the  maximum  value  of  the  risk,     r  (p,  J)  , 
for  each     o    ,  by  selecting  the    p    that  maximizes  it.     That  is, 

Max  risk    =    Max    r  (p.o)    . 
■peJL 

This  new  function,  the  maximum  risk,   is  dependent  on    O    alone,  and 
can  therefore  be  ordered  as  to  magnitude  with  the  magnitude  deter- 
mined by    O    •     Again  we  select  the  particular    q        that  minimizes 
the  yardstick.     That  is,  we  take 

Min  Max    r  (p,  cf)       for  all     O  . 
This  is  sometimes  written 

Max    r  (p,  (J  )    <    Max    r  (p,  cf )      for  all      O  . 
~P  °  -P 

The     o      °f  this  latter  expression  constitutes  a  Minimax  solution. 
The  statement  that  the  Minimax  solution  is  a  Bayes  solution 
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relative  to  the  least  favorable  a  priori  distribution  now  seems  reasonable. 
For,  if  our  a  priori  distribution  for  a  Bayes  solution  were  the  least  favor- 
able of  all,   it  would  lead  us  to  the    Max    r  (p,  £)    as  a  yardstick. 

The  procedure  used  to  arrive  at  a  Minimax  solution  is  summarized 
in  Figure  11  (see  following  page)  . 

2.  Relation  to  the  Theory  of  Games. 

The  reader  familiar  with  von  Neumann* s  theory  of  games  will  repog- 
nize  the  procedure  of  the  preceding  section  as  essentially  the  same  as 
that  of  game  theory.     In  fact,   Wald  points  out  the  detailed  correspon- 
dence between  the  statistical  decision  problem  in  which  no  a  priori  dis- 
tribution is  given  and  the  zero  sum,  two  person  game.     In  the  general 
case,  the  corresponding  game  is  a  continuous  one.     This  means  that  the 
question  of  the  strict  determinateness  of  the  game  must  be  investigated. 
Whereas  the  fundamental  theorem  of  rectangular  games  assures  the  ex- 
istence of  a  solution  to  any  finite  game,   no  such  assurance  exists  in  the 
case  of  all  infinite  games.     However,   Wald  demonstrates  {hat,  under 
suitable  assumptions,  any  statistical  decision  problem  viewed  as  a  con- 
tinuous game  may  be  approximated  arbitrarily  closely  by  a  finite  game. 
This  means  that,   even  if  the  continuous  game  is  not  strictly  determined, 
no  practical  limitation  is  imposed.     The  detailed  procedure  employed  in 
arriving  at  a  Minimax  solution  of  a  statistical  decision  problem  in  this 
manner  involves  the  formulation  of  the  problem  as  a  game,   and  the 
solution  of  the  game.     It  will  not  be  covered  here. 

3.  Summary. 

When  an  operations  analyst  is  confronted  with  the  need  to  make  a 
decision  on  the  basis  of  the  results  of  conducted  trials,   the  accuracy 
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of  which  depends  upon  the  true  value  of  an  unknown  parameter,   and  the 
cost  of  the  experimentation  required  to  estimate  the  value  of  the  para- 
meter is  significant,   a  statistical  decision  problem  is  indicated.     As 
Wald  puts  it,   in  two  sentences  here  taken  out  of  context, 

A  statistical  decision  problem  is  formulated  with  reference 
to  a  stochastic  process  ...  A  statistical  decision  problem 
with  reference  to  a  stochastic  process  X  arises  only  when 
the  distribution    F  (x)    is  not  completely  known. 

Once  a  statistical  decision  problem  has  arisen,   it  must  be  possible  to 
specify  the  stochastic  process,  the  parameter  space,  the  space  of 
terminal  decisions, the  weight  function  and  the  cost  function,   in  order 
to  solve  it.     A  solution  consists  of  determining  the  particular  statis- 
tical decision  function  that  prescribes  the  optimum  plan  for  conduct- 
ing experimentation  and  reaching  a  terminal  decision. 

The  procedure  employed  to  reach  a  solution  involves  the  use  of 
a  risk  function  as  a  basic  criterion  for  selection  the  optimum  decision 
function.     This  risk  function  takes  account  of  both  the  cost  of  wrong 
decision  and  the  cost  of  experimentation.     If  an  a  priori  distribution 
of  the  unknown  parameter  can  be  given,  the  final  yardstick  for  select- 
ing the  optimum  decision  function  is  the  average  risk;  if  not,  the 
final  yardstick  is  the  maximum  risk.     In  either  case,  the  yardstick 
is  ordered  as  to  magnitude,   and  that  decision  function  which  pro- 
vides the  least  value  of  the  yardstick  is  selected  as  a  solution. 
The  first  case  yields  a  Bayes  solution;  the  second  a  Minimax  solu- 
tion. 

The  consequence  of  the  final  decision  is  probabilistic.     This 
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means  that  the  final  decision  may,   in  a  particular  instance,   conceivably 
be  a  poor  one.     Nonetheless,  the  theory  offers  a  rational  approach  to 
the  type  of  problem  it  fits,  and  is  superior  to  any  other  known  approach,, 
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APPENDIX  A 
SOME  SELECTED  MATHEMATICAL  CONCEPTS 

1.  Probability. 

Probability  is  a  quantitative  measure  of  the  likelihood  of  the 
occurrence  of  events.     It  is  expressed  by  assigning  a  number     in  the 
range    (0.  1)        to  any  specific  event.     For  example,   if  an  event  is 
certain  to  occur     it  has  probability    1  ;  if  it  is  certain  not  to  occur 
it  has  probability    0    .     If  an  event  has  a  fifty-fifty  chance  of  occurtiftig 
it  has  probability   ■*■    .     The  probability  of  an  event  may  be  estimated 
by  conducting  repeated  trials  and  employing  the  formula 

,    ,  .,..  number  of  successes 

probability     =     r >  ,    .    4 

r  '  number  of  trials 

2.  Stochastic  Variables. 

A  stochastic  variable  may  be  defined  to  be  a  function  which  as  so- 
ciates  a  real  number  with  every  possible  outcome  of  an  experiment. 
The  outcome  of  any  particular  performance  of  the  experiment  is  said 
to  be  a  value  assumed  by  the  stochastic  variable,   it  being  understood 
that  this  outcome  is  a  chance  occurrence.     A  stochastic  variable  is 
termed  discrete  if  the  number  of  distinct  values  which  it  may  assume 
is  either  finite  or  may  be  arranged  in  a  sequence  (i.  e.  ,   is  denumerable). 
It  is  termed  continuous  if  its  possible  values  may  be  represented  by  an 
interval  on  the  real  line,   e.  g.  ,   all  the  points    x    such  that    a    <    x    <    b 
or      -00  <    x    <  CO. 

3.  The  Distribution  of  a  Discrete  Stochastic  Variable. 

The  correspondence  between  the  values  of  a  discrete  stochastic  var- 
iable 
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and  the  probabilities  that  it  will  take  on  these  values  may  be  described 
either  by  a  probability  function  (bar  graph)  or  by  a  cumulative  probability 
distribution  function  (step  function).     As  an  example  of  this,   consider  a 
single  true  die  to  be  tossed  a  large  number  of  times.     A  mathematical 
description  of  the  stochastic  nature  of  this  experiment  maybe  formu- 
lated as  follows: 


X: 


x.: 

1 


G(x): 
g(x): 


a  stochastic  variable  representing  the  value  shown 
on  the  die  after  any  throw. 

real  values  which  may  be  assumed  by  the  stochastic 
variable    X,   i.e.,    1,2,3,4,5,   and  6. 

the  probability  that    X    will  take  on  a  value  less  than 
or  equal  to    x    .     G  (x)    -    Pr  (X    <    x)  . 

the  probability  that    X    will  take  on  the  value    x    , 
g(x)    =    Pr  (X  =  x). 


These  quantities  may  be  displayed  as  follows: 


1  • 


9M 


I      2     3     4*6 
Bar    Graph 


0      12     3     4ft, 
Step  Function 

Distribution  of  A  Discrete  Stochastic  Variable 

Figure  12. 

The  bar  graph  indicates  that  the  probability  of  tossing  any  particular 

number  on  a  given  throw  is  the  same  for  all  numbers,   and  is  equal  to 

7-    .     The  step  function  is  an  alternative  way  of  presenting  essentially 
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the  same  information.     It  permits  the  probability  that  a  toss  will  show  a 
value  less  than  or  equal  to  any  given  value  to  be  read  directly.     For  in- 
stance,  the  probability  that  the  die  will  show  three  of  less  on  a  throw  is 

G<3>   =  I  -  7    ■ 

a  result  that  would  be  anticipated.     It  is  to  be  noted  that 

£_g(x    )    =    1  and  g(x.  )^0  i    =     1,2,3,4,5,6. 

Also, 

G  (0)    =    0       and  G  (6)    =    1  . 

These  are  fundamental  relations  associated  with  the  probability  func- 
tion and  cumulative  probability  distribution  function  of  the  stochastic 
variable    X    . 

4.  The  Distribution  of  a  Continuous  Stochastic  Variable. 

The  correspondence  between  the  values  of  a  continuous  stochas- 
tic variable  and  the  probabilities  that  it  will  take  on  these  values  may 
be  described  either  by  a  probability  density  function  or  by  a  cumula- 
tive probability  distribution  function.     As  an  example  of  this,   consider 
a  line  six  units  long  on  which  a  point  is  to  be  chosen  at  random.     This 
is  an  experiment  similar  to  the  one  used  to  describe  the  distribution 
of  a  discrete  stochastic  variable,  but  now  the  value  of  the  stochastic 
variable  may  be  any  number  in  the  closed  interval    [0,  6  ]  .     A  mathe- 
matical description  of  the  stochastic  nature  of  this  experiment  may  be 
formulated  as  follows: 

X:         a  stochastic  variable  representing  the  coordinate 
point  selected  on  any  try 

x:  real  values  which  the  stochastic  variable    X    may  assume. 
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G  (x):        the  probability  that    X    will  take  on  a  value  less  than 
or  equal  to    x    .     G  (x)    =    Pr  (X  <  x)  . 

g  (x):         the  probability  density  function  of    X  . 

g  (x)  dx  :         the  probability  that    X    will  take  on  a  value  between 
x    and    x  +  dx  .       g  (x)  dx    =    Pr  (x  <  X  <  x  +  dx)  . 

The  probability  density  function  and  the  cumulative  probability  distri- 
bution function  associated  with    X    may  be  displayed  as  follows: 

#*> 

i-- 


* 


1 


o  6  0  6 

Probability  Density  Function  Cumulative  Distribution  Function 

Distribution  of  a  Continuous  Stochastic  Variable 

Figure  13. 
The  particular  density  function  of  Figure  13.'    is  said  to  be  uniform. 
This  means  that  the  stochastic  variable  is  equally  likely  to  take  on 
any  one  of  its  values  and  accounts  for  the  straight,   horizontal  line 
which  represents  the  density  function.     Other  stochastic  variables 
may  have  a  bias  such  that  some  of  the  values  are  more  likely  to  occur 
than  others,   and  will  have  density  functions  which  are  not  represented 
by  horizontal  lines.     In  any  case,   the  area  under  the  density  function 
will  always  be     1     ,   and  the  cumulative  distribution  function  will  in- 
crease monotonically  to  a  maximum  value  of    1    for  increasing  values 


of   x    .     It  is  to  be  noted  that 

■6 

(x)  dx    =     1        and 
'o 


£ 


g  (x)  ^  0       for    0  <  x  <  6  . 
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Also, 

G  (0)    =    0  and  G  (6)    =     1. 

These  are  fundamental  relations  associated  with  the  probability  density 
function  and  the  cumulative  probability  distribution  function  of  the  sto- 
chastic variable    X    .     In  texts  on  probability  theory,   it  is  shown  that, 
for  continuous  stochastic  variables,   the  density  function,  when  it 
exists,   is  the  derivative  of  the  cumulative  distribution  function. 
This  is  a  basic  relation.     It  should  be  noted  that,  whereas  every  sto- 
chastic variable,     X    ,  has  a  cumulative  distribution  function    G  (x)  ," 

the  density  function 

.     .  d  G  (x) 

g  (  x)    =  dx 

exists  only  if    G  (x)    is  differentiable. 

5.  The  Expected  Value  of  a  Stochastic  Variable. 

The  expected  value  (average  value)  of  a  discrete  stochastic  vari- 
able is  defined  to  be 

(5.1)  E  (X)    =    x    =   £_  x    g  (x    )     , 

alii        x  x 

and  the  expected  value  of  a  continuous  stochastic  variable  which    has 
a  density  function  is  defined  to  be 

(5.2)  E  (X)    =    x    =       /x:;g(x)dx. 

a./l  x. 
The  expectation,     E  (X)     ,   is  often  termed  a  weighted  average.     In  the 

case  of  a  discrete  stochastic  variable,   the  "weight"  associated  with 

x.    is    g  (x.  )    .     In  the  case  of  a  continuous  stochastic  variable,  the 

i  °       i 

"weight"    associated  with    x    is    g  (x)  dx,   i.  e.  ,  the  probability  that 
x    lies  in    dx    .     In  the  latter  case  we  say,   more  briefly,   that    x    is 
weighted  by    g  (x)  ,   the  value  of  the  probability  density  function. 
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In  theoretical  discussions  it  may  not  be  desirable  to  distinguish 
between  discrete  and  continuous  stochastic  variables  nor  to  emphasize 
continuous  stochastic  variables  which  have  a  density  function.  In  such 
circumstances,  generally,  it  is  only  necessary  to  refer  to  the  stochas- 
tic variable,  say  X  ,  and  its  cumulative  probability  distribution  func- 
tion, say  G  (x)  .  To  represent  the  expected  value  of  X,  it  is  custom- 
ary to  write 

C 

(5.  3)  E  (X)    =  J  xdG  (x)    ,  where 

the  integral  on  the  right  represents  either  a  Riemann-Stieltjes  or  a 
Lebesgue-Stieltjes  integral  according  to  the  degree  of  generality  of  the 
theory  of  integration  under  consideration.     In  this  paper,    such  integrals 
will  be  viewed  as  Riemann-Stieltjes  integrals.     In  short,   we  may  regard 
the  integral  of  equation  (5.  3)  as  a  concept  which  includes  both  of  the 
concepts  of  equations  (5.  1)  and  (5.  Z)  as  special  cases.     We  should  note 
that  the  expected  value  of  a  function  of  a  stochastic  variable  may  also 
be  defined.     For  example,   if    h  (X)    is  a  function  of  the  stochastic 
variable    X  ,  we  may  write 

'CD 


[h(X)]    =        |     h(x)  dG(x)  . 


J- to 

As  a  further  illustration  of  the  notion  of  expectation  and  the  use  of  gen- 
eral functional  notation,   consider  the  following: 
P:         a  stochastic  variable 

p:  real  values  that  may  be  assumed  by    P 

^(p):  cumulative  probability  distribution  function  of    P 

5(p):  probability  density  function  of    P    . 
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The  expected  value  of    P    ,   according  to  equation  (5.  2)  ,     is 

E  (P)    »    p    »     J   p     3\p)  dp    . 

This  may  be  more  conveniently  expressed,  using  the  Riemann-Stieltjes 
integral,   as 


=        TPd£  , 


where  -fL  is  the  space  of    p    ,   and  the  differential    d  %   is  used  instead 
of    P  dp.       Thus,  values  of    p    are  weighted  by    d*?.  where  formerly 
values  of    p    were  weighted  by    5  (p)  dp    •     The  symbol 


/■ 


p  d? 

is  a  functional  symbol  used  to  express  the  notion' of  the  weighting  and 
summing.     When  actual  computations  are  carried  out,      d  §(p)      is 
replaced  by  its  equivalent  expression  in    p    ,   and  the  integration  is 
carried  out  just  as  in  elementary  calculus. 

6.  Joint  Distribution  Functions. 

So  far  we  have  discussed  only  distribution  functions  of  a  single 
stochastic  variable    X    .     The  notion  of  joint  distribution  functions  of 
more  than  one  stochastic  variable  is  often  employed  in  probability 
theory.     This  is  nothing  more  than  an  extension  of  the  idea  of  the 
distribution  function  of  a  single  stochastic  variable.     For  example, 
if    X.     and    X?    are  two  stochastic  variables,  then  their  joint  cumu- 
lative distribution  function,     F  (x.,x?  )    ,   is  the  probability  that 
X,     <    x.     and    X~    <    x2    simultaneously.     That  is 

F  (x.,x2)    =    Pr  (X.     <    x.     and    X,    <    x?  )    simultaneously. 
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The  density  function  of  such  a  distribution  may  be  represented  as  a  sur- 
face in  three  dimensional  space  as  follows: 


2       Joint  Density  Function 
Figure  14. 
It  is  to  be  noted  that 

f  (x.,x?)  dx.  dx2      =      1         and    f  (x.,x2)  ^.0      for  all    x.,x2 


f(' 


Also, 


f  (x1,x2) 


"g    F  (xltx2) 
3  xl    d  x2 


These  are  fundamental  relations  associated  with  the  joint  distribution. 
In  a  similar  manner,  the  analytical  notion  of  a  joint  distribution  func- 
tion may  be  extended  to  any  number  of  stochastic  variables,   although 
the  geometrical  representation  does  not  apply  for  more  than  two. 

7.  Bayes  Theorem. 

Perhaps  the  single  mathematical  concept  most  vital  to  an  under- 
standing of  statistical  decision  theory  is  the  Bayes  theorem  of  inverse 
probability.     To  explain  this  theorem  in  terms  of  the  example  of 
Chapter  I,   let  us  recall  that  we  assumed  that    P    ,   the  true  percen- 
tage success  of  the  midget  submarine  in  a  future  war,   is  equally 
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likely  to  have  any  value  between  0  and  100%.     This  is  equivalent  to 
assuming  that  the  a  priori  distribution  of  the  parameter  is  uniform  and, 
at  the  outset,    represents  our  best  knowledge  of    P    ,     As  the  problem 
progresses,   observations  are  made.     These  observations  add  to  our 
knowledge  of    P    ,   and  we  therefore  wish  to  modify  the  originally  as- 
sumed a  priori  distribution  of    p    to  what  we  term  an  a  posteriori  dis- 
tribution of    P    on  the  basis  of  the  observations.     Bayes  theorem  pro- 
vides the  means  to  do  this.     That  is,   if  an  a  priori  distribution  is 
known  and  observations  are  subsequently  made  ,   Bayes  theorem  may 
be  used  to  modify  the  a  priori  distribution  to  an  a  posteriori  distribu- 
tion on  the  basis  of  the  observations.     Two  forms  of  Bayes  theorem 
in  its  application  to  density  functions  in  statistical  decision  theory  are: 
Case  I    (-0. Discrete): 


(TTJ 


fm.~j 


I  !•  (P)  ^T    g  (x.  |j) 

f  '  (p)       =  *  1=1  x 

g  §>>  ft  mis 


Case  II    ( -TL  Continuous)  : 

rrro 


/  §<P)  TT    g(x.|p) 

5(p)  =     — — 


S(P)  "TTl  8  (xJpJdp 


In  these  formulas,     g  (x.  |  j)    is  a  probability  function  when    X      is 
discrete  and  a  probability  density  function  when    X    is  continuous, 
and  the  integer    n    is  the  number  of  values    P    may  assume  in  Case  I. 
To  examine  the  theorem  further,    let  us  study  an  example  which 
finds  application  in  this  paper.      Suppose 
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X.  (i=l,  2, .  .  .  ):;       a  collection  of  independently  and  identically  distributed 
discrete  stochastic  variables 

x^  (i=l,  Z, .  .  .  ):        real  values  that  may  be  assumed  by  each    X-    .     Each 

x.    is  confined  to  the  two  values    0    or     1  . 

G  (x):  the    common  cumulative  probability  distribution 

(step)  function  applicable  to  each  of  the    X.    . 

g  (x):  the  common  probability  function  (bar  graph)  appli- 

cable to  each  of  the    X.    . 

P:  a  continuous  stochastic  variable  representing  the 

parameter  of    G  (x)    or    g  (x)    . 

p:         real  values  that  may  be  assumed  by    P    ,   i.  e. 
0 <  p<  1  . 

5(p):  the  a  priori  cumulative  probability  distribution 

function  of    P    . 

5(p):  the  a  priori  probability  density  function  of    P    . 


•m 
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>(p):  the  a  posteriori  cumulative  distribution  function  of 

P    after    m    observations  on  the    X.    . 

}y(p)'  the  a  posteriori  density  function  of    P    ,  after    m 


observations  on  the    X. 

i 


The  Bayes  formula  for  the  density  functions  of    P    is,   as  in  Case  II.  , 
/  g'(p)     ^       g.txjp) 

I" 

'-TL 

where  -A-    is  the  space  of    P    .     Let  us  amplify  this  with  some  diagrams 


?(P)    Ji    g  (x.|p)dp 


and  sample  computations.     Suppose  each    X.    is  distributed  according 
to  the  following  diagram: 
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0  1  0  1 

Distribution  of  the    X 

Figure  15. 

The  letter    p    stands  for  a  value  of  the  unknown  parameter.     If  we 

assume  the  a  priori  density  function  of    P   to  be  uniform,   it  may  be 

pictured  as  follows: 


0  I 

A  Priori  Density  Function  of    P 

Figure  16. 

If  we  take  a  single  observation  on  one  of  the    X.    with  the  result 

x-  =  1    ,  we  may  apply  Bayes  formula  as  follows: 


J(P) 


(D(p) 


/'(Dpdp 
Jo 


=      2p. 


This  a  posteriori  density  function  may  be  pictures  as  follows: 
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o  t 

A  Posteriori  Density  Function  of    P    for    x»  ■  1  . 
Figure     17  . 
Note  that  the  result  of  the  single  observation,   through  the  Bayes  formu- 
la, has  modified  the  density  function  of    P    from  uniform  to  a  bias  in 
favor  of  the  value     1    .     This  is  an  intuitively  reasonable  result,    since 
the  value    x.  =  1    was  observed.     Similarly,   if  the  result  of  the 
single  observation  had  been    x.  =  0  ,     the  a  posteriori  density  function 
would  have  been  modified  from  uniform  to  a  bias  in  favor  of  the  value 
0.     In  that  case, 


:(p) 


d)(i-p) 
j^(D(i-P)dp 


=      2    -    2  p  , 


and  the  a  posteriori  density  function  becomes  the  following: 


r-f 


A  Posteriori  Density  Function  of    P    for  x. 
Figure  18. 


=  0 
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Again,   if  two  observations  had  been  taken  with  the  results    x.  =  1    and 
x2  =  0  ,     then  the  a  posteriori  density  function  would  be 


,§(p,    =     <i>(p>u-p> — 

f  (l)(p)(l-p)  dp 


6  p    -    6  p 


and  is  pictured  as  follows: 


0  &        1 

A  Posteriori  Density  Function  of    P    for    x.  =  1  ,     x2  =  0    . 

Figure  19. 
Note  that  this  last  density  function  is  a  parabola,  and  has  been  modified 
from  uniform  to  a  bias  in  favor  of  the  value    -*    .     This  is  again  an  in- 
tuitively reasonable  result  to  follow  from  the  two  observations. 
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